Goto

Collaborating Authors

 natural conversation


Voice-based AI Agents: Filling the Economic Gaps in Digital Health Delivery

Wen, Bo, Wang, Chen, Han, Qiwei, Norel, Raquel, Liu, Julia, Stappenbeck, Thaddeus, Rogers, Jeffrey L.

arXiv.org Artificial Intelligence

--The integration of voice-based AI agents in healthcare presents a transformative opportunity to bridge economic and accessibility gaps in digital health delivery. This paper explores the role of large language model (LLM)-powered voice assistants in enhancing preventive care and continuous patient monitoring, particularly in underserved populations. Drawing insights from the development and pilot study of Agent PULSE (Patient Understanding and Liaison Support Engine)--a collaborative initiative between IBM Research, Cleveland Clinic Foundation, and Morehouse School of Medicine--we present an economic model demonstrating how AI agents can provide cost-effective healthcare services where human intervention is economically unfeasible. Our pilot study with 33 inflammatory bowel disease patients revealed that 70% expressed acceptance of AI-driven monitoring, with 37% preferring it over traditional modalities. T echnical challenges, including real-time conversational AI processing, integration with healthcare systems, and privacy compliance, are analyzed alongside policy considerations surrounding regulation, bias mitigation, and patient autonomy. Our findings suggest that AI-driven voice agents not only enhance healthcare scalability and efficiency but also improve patient engagement and accessibility. For healthcare executives, our cost-utility analysis demonstrates huge potential savings for routine monitoring tasks, while technologists can leverage our framework to prioritize improvements yielding the highest patient impact. By addressing current limitations and aligning AI development with ethical and regulatory frameworks, voice-based AI agents can serve as a critical entry point for equitable, sustainable digital healthcare solutions. Healthcare systems worldwide face growing challenges in allocating limited medical resources to meet increasing demand [1], [2]. Traditional healthcare delivery models, centered on episodic patient-provider interactions, often result in significant gaps in continuous care, particularly in preventive health monitoring and chronic disease management [2], [3]. These shortcomings disproportionately affect vulnerable populations, including those with limited access to healthcare facilities [4], lower technological literacy [5], or socio-economic constraints [6]. The advent of Large Language Models (LLMs) and multi-modal AI has opened new avenues for digital health applications [7]-[10], notably in voice-based patient engagement [11], [12]. Unlike earlier rule-based conversational agents, modern AI-driven voice assistants can facilitate context-aware, adaptive, and natural conversations that dynamically adjust to user preferences, health literacy levels, and immediate needs [13]. V oice, as humanity's most intuitive mode of communication, reduces engagement barriers and broadens access to healthcare, especially for underserved communities [12], [14].


Fox News AI Newsletter: Amazing breakthrough for paralyzed man who can't speak

FOX News

Thanks to a team at the University of California, Davis, theres a new brain-computer interface (BCI) system thats opening up real-time, natural conversation for people who cant speak. VOICE BREAKTHROUGH: When someone loses the ability to speak because of a neurological condition like ALS, the impact goes far beyond words. Now, thanks to a team at the University of California, Davis, there's a new brain-computer interface (BCI) system that's opening up real-time, natural conversation for people who can't speak. Instead, it translates the brain signals that would normally control the muscles used for speech, allowing users to "talk" and even "sing" through a computer, almost instantly. JOBS ON THE LINE: If you've ordered food on Uber Eats recently, you may have seen a delivery robot instead of a human driver.


Paralyzed man speaks and sings with AI brain-computer interface

FOX News

When someone loses the ability to speak because of a neurological condition like ALS, the impact goes far beyond words. Now, thanks to a team at the University of California, Davis, there's a new brain-computer interface (BCI) system that's opening up real-time, natural conversation for people who can't speak. Instead, it translates the brain signals that would normally control the muscles used for speech, allowing users to "talk" and even "sing" through a computer, almost instantly. Sign up for my FREE CyberGuy Report Get my best tech tips, urgent security alerts, and exclusive deals delivered straight to your inbox. Plus, you'll get instant access to my Ultimate Scam Survival Guide - free when you join my CYBERGUY.COM/NEWSLETTER.


CASPER: A Large Scale Spontaneous Speech Dataset

Xiao, Cihan, Liang, Ruixing, Zhang, Xiangyu, Tiryaki, Mehmet Emre, Bae, Veronica, Shankar, Lavanya, Yang, Rong, Poon, Ethan, Dupoux, Emmanuel, Khudanpur, Sanjeev, Perera, Leibny Paola Garcia

arXiv.org Artificial Intelligence

The majority (67.79%) reported speaking US English, reflecting the dataset's primary demographic. However, a significant proportion of non-native and regionally influenced English varieties are also present, including Chinese Mandarin-influenced English (4.81%), UK English (5.29%), and Indian English (2.88%). Additionally, 14.42% of participants did not specify an accent, indicating either an omission or variability in self-identification. The participants' accent and native language are based on their self-identification, for example, the number of speakers with an Arabic accent may differ from the number with Arabic as their native language. Age distribution reveals that younger speakers are over-represented, with 57.21% of participants in the 18-29 age range and 23.56% in the 30-39 range.


Paralinguistics-Aware Speech-Empowered Large Language Models for Natural Conversation

Neural Information Processing Systems

Recent work shows promising results in expanding the capabilities of large language models (LLM) to directly understand and synthesize speech. However, an LLM-based strategy for modeling spoken dialogs remains elusive, calling for further investigation. This paper introduces an extensive speech-text LLM framework, the Unified Spoken Dialog Model (USDM), designed to generate coherent spoken responses with naturally occurring prosodic features relevant to the given input speech without relying on explicit automatic speech recognition (ASR) or text-to-speech (TTS) systems. We have verified the inclusion of prosody in speech tokens that predominantly contain semantic information and have used this foundation to construct a prosody-infused speech-text model. Additionally, we propose a generalized speech-text pretraining scheme that enhances the capture of cross-modal semantics.


SER_AMPEL: a multi-source dataset for speech emotion recognition of Italian older adults

Grossi, Alessandra, Gasparini, Francesca

arXiv.org Artificial Intelligence

In this paper, SER_AMPEL, a multi-source dataset for speech emotion recognition (SER) is presented. The peculiarity of the dataset is that it is collected with the aim of providing a reference for speech emotion recognition in case of Italian older adults. The dataset is collected following different protocols, in particular considering acted conversations, extracted from movies and TV series, and recording natural conversations where the emotions are elicited by proper questions. The evidence of the need for such a dataset emerges from the analysis of the state of the art. Preliminary considerations on the critical issues of SER are reported analyzing the classification results on a subset of the proposed dataset.


Kids will soon be able to have natural conversations with Alexa

Engadget

Amazon used its annual hardware event on Wednesday to go all-in on Alexa's new large language model-infused capabilities, touting how easy it'll soon be to have a natural sounding conversation with the bot. This also extends to kids, as the company just announced Explore With Alexa. This is a pared-down and kid-friendly version of the updated chatbot that specializes in topics like animals and nature. It'll even play trivia games with your tykes and disperse daily fun facts. Of course, this is for kids, so the tech has been developed with guardrails to protect them from the more sinister parts of the Internet.


ChatSonic - Like ChatGPT but with superpowers

#artificialintelligence

ChatGPT is an open-source conversational AI system created by OpenAI, founded by Sam Altman. It is powered by a neural network that has been trained on millions of conversations. It is designed to understand natural language and respond in a meaningful way. The system is based on the GPT-3 model, which is a large-scale language model developed by OpenAI that has been trained on hundreds of billions of words from the internet. The model is used to generate text responses to user input in a conversational manner.


Turn-Taking Prediction for Natural Conversational Speech

Chang, Shuo-yiin, Li, Bo, Sainath, Tara N., Zhang, Chao, Strohman, Trevor, Liang, Qiao, He, Yanzhang

arXiv.org Artificial Intelligence

While a streaming voice assistant system has been used in many applications, this system typically focuses on unnatural, one-shot interactions assuming input from a single voice query without hesitation or disfluency. However, a common conversational utterance often involves multiple queries with turn-taking, in addition to disfluencies. These disfluencies include pausing to think, hesitations, word lengthening, filled pauses and repeated phrases. This makes doing speech recognition with conversational speech, including one with multiple queries, a challenging task. To better model the conversational interaction, it is critical to discriminate disfluencies and end of query in order to allow the user to hold the floor for disfluencies while having the system respond as quickly as possible when the user has finished speaking. In this paper, we present a turntaking predictor built on top of the end-to-end (E2E) speech recognizer. Our best system is obtained by jointly optimizing for ASR task and detecting when the user is paused to think or finished speaking. The proposed approach demonstrates over 97% recall rate and 85% precision rate on predicting true turn-taking with only 100 ms latency on a test set designed with 4 types of disfluencies inserted in conversational utterances.


AWS Touts Partners' Conversational AI Solutions

#artificialintelligence

Amazon Web Services is putting the focus on partners' conversational artificial intelligence (CAI) solutions that could spell the end of organizations' customers screaming "representative" to an interactive voice response phone system or getting stuck in a dead-end or circular digital chat loop. AWS is highlighting solutions from consulting partners including Cation Consulting, Deloitte Consulting, Quantiphi and TensorIoT and technology partners including NLX, ServisBOT and XAPP AI that allow organizations to deploy chatbots, virtual assistants and interactive voice response systems that incorporate AWS artificial intelligence and machine learning services. Their solutions employ services including Amazon Kendra, a machine learning-powered search tool that allows users to search unstructured text using natural language; Amazon Lex, a service for building conversational interfaces into applications using voice and text; and Amazon Polly, a text-to-speech service that converts text into lifelike speech. The new partner initiative comes as the demand for CAI interfaces continues to grow, according to Arte Merritt, who leads AWS partnerships for contact center intelligence and conversational AI. End-customers increasingly prefer to interact with businesses on digital channels, and businesses want to increase user satisfaction, reduce operational costs and streamline business processes, Merritt said in a blog post.